pOSKI: An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures
نویسنده
چکیده
We have developed pOSKI: the Parallel Optimized Sparse Kernel Interface – an autotuning framework to optimize Sparse Matrix Vector Multiply (SpMV) performance on emerging shared memory multicore architectures. Our autotuning methodology extends previous work done in the scientific computing community targeting serial architectures. In addition to previously explored parallel optimizations, we find that that load balanced data decomposition is extremely important to achieving good parallel performance on the new generation of parallel architectures. Our best parallel configurations perform up to 9x faster than optimized serial codes on the AMD Santa Rosa architecture, 11.3x faster on the AMD Barcelona architecture, and 7.2x faster on the Intel Clovertown architecture.
منابع مشابه
Autotuning Sparse Matrix-Vector Multiplication for Multicore
Sparse matrix-vector multiplication (SpMV) is an important kernel in scientific and engineering computing. Straightforward parallel implementations of SpMV often perform poorly, and with the increasing variety of architectural features in multicore processors, it is getting more difficult to determine the sparse matrix data structure and corresponding SpMV implementation that optimize performan...
متن کاملMATS : A Model-Driven Adaptive Tuning System for Parallel Workloads
Building software that can effectively utilize underlying hardware resources has been a perennial challenge for the high-performance computing community. In recent years, the HPC community has responded to this challenge by creating adaptive compilation systems that allow domain experts to automatically tune their code to different architectures; thus relieving some of the burden of manual reta...
متن کاملTowards Autotuning of OpenMP Applications on Multicore Architectures
In this paper we describe an autotuning tool for optimization of OpenMP applications on highly multicore and multithreaded architectures. Our work was motivated by in-depth performance analysis of scientific applications and synthetic benchmarks on IBM Power 775 architecture. The tool provides an automatic code instrumentation of OpenMP parallel regions. Based on measurement of chosen hardware ...
متن کاملA Fully Empirical Autotuned Dense QR Factorization for Multicore Architectures
Tuning numerical libraries has become more difficult over time, as systems get more sophisticated. In particular, modern multicore machines make the behaviour of algorithms hard to forecast and model. In this paper, we tackle the issue of tuning a dense QR factorization on multicore architectures. We show that it is hard to rely on a model, which motivates us to design a fully empirical approac...
متن کاملFully Empirical Autotuned QR Factorization For Multicore Architectures
Tuning numerical libraries has become more difficult over time, as systems get more sophisticated. In particular, modern multicore machines make the behaviour of algorithms hard to forecast and model. In this paper, we tackle the issue of tuning a dense QR factorization on multicore architectures. We show that it is hard to rely on a model, which motivates us to design a fully empirical approac...
متن کامل